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Abstract 

We construct classifiers for multivariate and functional data. Our approach is 
based on a kind of distance between data points and classes. The distance measure 
needs to be robust to outliers and invariant to linear transformations of the data. 
For this purpose we can use the bagdistance which is based on halfspace depth. It 
satisfies most of the properties of a norm but is able to reflect asymmetry when the 
class is skewed. Alternatively we can compute a measure of outlyingness based on 
the skew-adjusted projection depth. In either case we propose the DistSpace trans¬ 
form which maps each data point to the vector of its distances to all classes, followed 
by fc-nearest neighbor (kNN) classification of the transformed data points. This com¬ 
bines invariance and robustness with the simplicity and wide applicability of kNN. The 
proposal is compared with other methods in experiments with real and simulated data. 


1 Introduction 


Supervised classification of multivariate data is a common statistical problem. One is 
given a training set of observations and their membership to certain groups (classes). 
Based on this information, one must assign new observations to these groups. Examples 
of classification rules include, but are not limited to, linear and quadratic discriminant 
analysis, fc-nearest neighbors (kNN), support vector machines, and decision trees. For an 
overview see e.g. Hastie et al. (2009). 

However, real data often contain outlying observations. Outliers can be caused by 
recording errors or typing mistakes, but they may also be valid observations that were 
sampled from a different population. Moreover, in supervised classification some obser¬ 
vations in the training set may have been mislabeled, i.e. attributed to the wrong group. 
To reduce the potential effect of outliers on the data analysis and to detect them, many 


robust methods have been developed, see e.g. Rousseeuw and Leroy (1987) and Maronna 


et al. (2006). 


Many of the classical and robust classification methods rely on distributional assump¬ 


tions such as multivariate normality or elliptical symmetry (Hubert and Van Driessen 


2004). Most robust approaches that can deal with more general data make use of the con¬ 


cept of depth, which measures the centrality of a point relative to a multivariate sample. 
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The first type of depth was the halfspace depth of Tukey (1975), followed by other depth 


functions such as simplicial depth (Liu 1990) and projection depth (Zuo and Serfling 

20001 ). 


Several authors have used depth in the context of classification. Christmann and 


Rousseeuw (2001) and Christmann et al. (2002) applied regression depth (Rousseeuw and 
Hubert 


, 1999). The maximum depth classification rule of 

Liu 

(1990 

) was studied by 

Ghosh 


and Chaudhuri (2005) and extended by Li et al. (2012). Dutta and Ghosh (2011) used 


projection depth. 

In this paper we will present a novel technique called classification in distance space. 
It aims to provide a fully non-parametric tool for the robust supervised classification of 
possibly skewed multivariate data. In Sections [2] and [3] we will describe the key concepts 
needed for our construction. Section [4] discusses some existing multivariate classifiers and 
introduces our approach. A thorough simulation study for multivariate data is performed 
in Section [5j From Section [6] onwards we will focus our attention on the increasingly 
important framework of functional data, the analysis of which is a rapidly growing field. 
We will start by a general description, and then extend our work on multivariate classifiers 
to functional classifiers. 


2 Multivariate depth and distance measures 

2.1 Halfspace depth 

If Y is a random variable on M p with distribution Py, then the halfspace depth of any 
point x E relative to Py is defined as the minimal probability mass contained in a 
closed halfspace with boundary through x: 

HD(*; P Y ) = inf Py {v'Y v'x} . 

IMI=i 


Halfspace depth satisfies the requirements of a statistical depth function as formulated by 

: it is affine invariant (i.e. invariant to translations and nonsingular 
linear transformations), it attains its maximum value at the center of symmetry if there 
is one, it is monotone decreasing along rays emanating from the center, and it vanishes at 
infinity. 

For any statistical depth function D and for any a E [0,1] the a-depth region D a is 
the set of points whose depth is at least a: 


Zuo and Serfling (2000 


D a = {x E M p ; D(«; Py) a} . 


(1) 


The boundary of D a is known as the a-depth contour. The halfspace depth regions are 
closed, convex, and nested for increasing a. Several properties of the halfspace depth 


function and its contours were studied in Masse and Theodorescu (1994) and Rousseeuw 


and Ruts (1999). The halfspace median (or Tukey median) is defined as the center of 


gravity of the smallest non-empty depth region, i.e. the region containing the points with 
maximal halfspace depth. 
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The finite-sample definitions of the halfspace depth, the Tukey median and the depth 
regions are obtained by replacing Py by the empirical probability distribution P n . Many 
finite-sample properties, including the breakdown value of the Tukey median, were derived 


m 


Donoho and Gasko (1992). 


To compute the halfspace depth, several affine invariant algorithms have been devel¬ 
oped. Rousseeuw and Ruts (1996) and Rousseeuw and Struyf (1998) provided exact algo¬ 


rithms in two and three dimensions and an approximate algorithm in higher dimensions. 


Recently Dyckerhoff and Mozharovskyi (2016) developed exact algorithms in higher dimen¬ 


sions. Algorithms to compute the halfspace median have been developed by Rousseeuw 


the algorithms constructed by 
applicable to at least p = 5. 


and Ruts (1998) and Struyf and Rousseeuw (2000). To compute the depth contours the 


algorithm of Ruts and Rousseeuw (1996) can be used in the bivariate setting, whereas 

d2010b and 


Hallin et al. 


Paindaveine and Siman 


(2012) are 


2.2 The bagplot 


The bagplot of Rousseeuw et al. (1999) generalizes the univariate boxplot to bivariate 


data, as illustrated in Figure [lj The dark-colored bag is the smallest depth region with at 
least 50% probability mass, i.e. B = such that Py(B ) ^ 0.5 and Py(D a ) < 0.5 for 
all a > a. The white region inside the bag is the smallest depth region, which contains 
the halfspace median (plotted as a red diamond). The fence, which itself is rarely drawn, 
is obtained by inflating the bag by a factor 3 relative to the median, and the data points 
outside of it are flagged as outliers and plotted as stars. The light-colored loop is the 
convex hull of the data points inside the fence. 



Figure 1: Bagplot of a bivariate dataset. 

The bagplot exposes several features of the bivariate data distribution: its center (by 
the Tukey median), its dispersion and shape (through the sizes and shape of the bag and 
the loop) and the presence or absence of outliers. In Figure [l] we see a moderate deviation 
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from symmetry as well as several observations that lie outside the fence. One could extend 
the notion of bagplot to higher dimensions as well, but a graphical representation then 
becomes harder or impossible. 


2.3 The bagdistance 


Although the halfspace depth is small in outliers, it does not tell us how distant they are 
from the center of the data. Also note that any point outside the convex hull of the data 
has zero halfspace depth, which is not so informative. Based on the concept of halfspace 
depth, we can however derive a statistical distance of a multivariate point x G M p to Py 


as in (Hubert et al., 2015). This distance uses both the center and the dispersion of Py. 
To account for the dispersion it uses the bag B defined above. Next, c(x) := c x is defined 
as the intersection of the boundary of B and the ray from the halfspace median 6 through 
x. The bagdistance of * to Y is then given by the ratio of the Euclidean distance of x to 
0 and the Euclidean distance of c x to 6: 


bd(x;Py) | || a ._0||/|| Ca ._0|| elsewhere. ^ 

The denominator in ([2]) accounts for the dispersion of Py in the direction of x. Note that 
the bagdistance does not assume symmetry and is affine invariant. 



Figure 2: Illustration of the bagdistance between an arbitrary point and a sample. 

The finite-sample definition is similar and illustrated in Figure [2] for the data set in 
Figure [lj Now the bag is shown in gray. For two new points X\ and *2 their Euclidean 
distance to the halfspace median is marked by dark blue lines, whereas the orange lines 
correspond to the denominator of ([2]) and reflect how these distances will be scaled. Here 
the lengths of the blue lines are the same (although they look different as the scales of 
the coordinates axes are quite distinct). On the other hand the bagdistance of x\ is 7.43 
and that of X 2 is only 0.62. These values reflect the position of the points relative to the 
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sample, one lying quite far from the most central half of the data and the other one lying 
well within the central half. 

Similarly we can compute the bagdistance of the outliers. For the uppermost right 
outlier with coordinates (3855,305) we obtain 4.21, whereas the bagdistance of the lower 
outlier (3690,146) is 3.18. Both distances are larger than 3, the bagdistance of all points 
on the fence, but the bagdistance now reflects the fact that the lower outlier is merely a 
boundary case. The upper outlier is more distant, but still not as remote as x\. 

We will now provide some properties of the bagdistance. We define a generalized norm 
as a function g : MP —> [0, oo[ such that g(0) = 0 and g(x ) 0 for * / 0, which satisfies 

g('yx) = 7 g(x) for all x and all 7 > 0. In particular, for a positive definite p x p matrix 
S it holds that 

g(x) = \J x'H,~ l x (3) 

is a generalized norm (and even a norm). 

Now suppose we have a compact set B which is star-shaped about zero, i.e. for all 
x E B and 0 ^ 7 ^ 1 it holds that 72 : E B. For every i/Owe then construct the point 
c x as the intersection between the boundary of B and the ray emanating from 0 in the 
direction of x. Let us assume that 0 is in the interior of B, that is, there exists e > 0 such 
that the ball B(0,e) C B. Then ||ca.|| > 0 whenever x / 0. Now define 


s(*) 



if x = 0 
otherwise. 


(4) 


Note that we do not really need the Euclidean norm, as we can equivalently define g{x) 
as inf {7 > 0; 7 _ 1 a; E B}. We can verify that g(-) is a generalized norm, which need not 
be a continuous function. The following result shows more. 


Theorem 1. If the set B is convex and compact and 0 E int(B) then the function g 
defined in Q is a convex function and hence continuous. 

Proof. We need to show that 


g(Xx + (1 - \)y) sC Xg(x) + (1 - X)g(y) (5) 

for any x, y E and 0 ^ A ^ 1. In case {0, x, y} are collinear the function g restricted to 
this line is 0 in the origin and goes up linearly in both directions (possibly with different 
slopes) so ([5]) is satisfied for those x and y. If {0 ,x,y} are not collinear they form 
a triangle. Note that we can write x = g(x)c x and y = g(y)c y and we will denote 
z := Xx T (1 — A )y. We can verify that 2 := (A g(x) + (1 — A )g{y))~ 1 z is a convex 
combination of c x and c y . By compactness of B we know that c x ,c y E B, and from 
convexity of B it then follows that 2 E B. Therefore 


= c? 




so that finally 

9( z ) = ^ [pi = X 9(x) + (1 - X)g{y) . □ 

ll c z|| 11*11 
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Note that this result generalizes Theorem 2 of Hubert et al. (2015) from halfspace depth 


to general convex sets. It follows that g satisfies the triangle inequality since 


g(x + y) = 2g(^x + ^ y ) ^ 2 \g{x) + 2 ^g(y) = g(x) + g(y) 


Therefore g (and thus the bagdistance) satisfies the conditions 

(i) g(x) ^ 0 for all x 6 

(ii) g(x) = 0 implies x = 0 

(iii) g( 72 :) = 7 g(x) for all x € M p and 7^0 

(iv) g(x + y) ^ g(x) + g(y) for all x,y . 

This is almost a norm, in fact, it would become a norm if we were to add 


g(—x) = g(x) for all x £ M p . 


The generalization makes it possible for g to reflect asymmetric dispersion. (We could 
easily turn it into a norm by computing h(x) = (g(x) 4 - g(—x))/2 but then we would lose 
that ability.) 

Also note that the function g defined in Q does generalize the Mahalanobis distance 
in <§> as can be seen by taking B = jx; x'H~x ^ 1} which implies c x = (x'Xl l x) 1 / 2 x 
for all x / 0 so g{x) = ||x|| /((x'S ^ 1 ®)^ 1 / 2 ||x||) = Vx'Y,~ 1 x . 

Finally note that Theorem [l] holds whenever B is a convex set. Instead of halfspace 
depth we could also use regions of projection depth or the depth function in Section [3| On 
the other hand, if we wanted to describe nonconvex data regions we would have to switch 
to a different star-shaped set B in Q. 

In the univariate case, the compact convex set B in Theorem [I] becomes a closed 
interval which we can denote by B = [— with a, b > 0, so that 

g(x) = ax + + bx~ . 


In linear regression the minimization of ^” = i < 7 ( 77 ) yields the a/{a + b ) regression quantile 
of Koenker and Bassett (1978). 

It is straightforward to extend Theorem [l] to a nonzero center by subtracting the center 
first. 

To compute the bagdistance of a point x with respect to a p-variate sample we can 
first compute the bag and then the intersection point c x . In low dimensions computing 
the bag is feasible, and it is worth the effort if the bagdistance needs to be computed for 
many points. In higher dimensions computing the bag is harder, and then a simpler and 
faster algorithm is to search for the multivariate point c* on the ray from 6 through x 
such that 

HD (c*-P n ) = med {HD(y 4 ; P n )} (6) 


where y L are the data points. Since HD is monotone decreasing on the ray this can be 
done fairly fast, e.g. by means of the bisection algorithm. 

Table[l]lists the computation time needed to calculate the bagdistance of m G {1, 50,100,1000} 
points with respect to a sample of n = 100 points in dimensions p 6 {2, 3, 4, 5}. For p = 2 


6 









the algorithm of Ruts and Rousseeuw (1996) is used and ([6]) otherwise. The times are 
averages over 1000 randomly generated data sets. In each of the 1000 runs the points were 
generated from a centered multivariate normal distribution with a randomly generated 
covariance matrix. Note that the time for m = 1 is essentially that of the right hand side 

of del). 


Table 1: Computation times for the bagdistance (n = 100), in units of 0.001 seconds. 


p 

1 

50 

m 

100 

1000 

2 

15.6 

16.2 

17.4 

17.1 

3 

34.8 

67.8 

84.1 

310.2 

4 

45.3 

88.3 

107.9 

377.3 

5 

56.4 

106.3 

128.2 

432.8 


3 Skew-adjusted projection depth 

Since the introduction of halfspace depth various other affine invariant depth functions 
have been defined (for an overview see e.g. Mosler (2013)), among which projection depth 
(Zuo, 2003) which is essentially the inverse of the Stahel-Donoho outlyingness (SDO). The 
population SDO (Stahel, 1981 Donoho, 1982) of an arbitrary point x with respect to a 
random variable Y with distribution Py is defined as 

v'x — med(i/y) | 


SDO(*;Py)= sup 
|MI=i 

from which the projection depth is derived: 

PD(ai; P Y ) = 


MAD(dT) 


1 


1 + SDO (x-P Y ) ' 

Since the SDO has an absolute deviation in the numerator and uses the MAD in its de¬ 


nominator it is best suited for symmetric distributions. For asymmetric distributions Brys 


et al. (2005) proposed the adjusted outlyingness (AO) in the context of robust independent 


component analysis. It is defined as 

AO(x;Py)= sup AOi (v'x;P v iy) 

IMI=i 

where the univariate adjusted outlyingness AOi is given by 


z—med(Z) 


AO, l if2r>med(Z) 

AUl [Z ’ ^ Z) ~ \ ^Z)-z lf < , f z) 

{ med(Z)—(z) 11 Z ^ meCl^Z/J . 


(7) 
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Here 

m(Z) = Q X {Z) - 1.5 e _4MC ( z) IQR(Z) 
w 2 (Z) = Q 3 (Z) + 1.5 e + 3MC ( z ) IQR(Z) 


if MC (Z) ^ 0, where Qi(Z ) and Q^(Z) denote the first and third quartile of Z, IQR(Z) = 
Q 3 (Z)—Qi(Z) and MC (Z) is robust measure of skewness (Brys et ah, 2004). If MC(Z) < 0 
we replace (z,Z) by (—z,—Z). The de nominator of (|7 ) corresponds to the fence of the 
univariate adjusted boxplot proposed by Hubert and Vandervieren (2008). 

The skew-adjusted projection depth (SPD) is then given by ( |Hubert et al. 2015): 

SPD(»;iV)= 1+A0 1 (x;fv) . 


To compute the finite-sample SPD we have to rely on approximate algorithms, as it is 
infeasible to consider all directions v. A convenient affine invariant procedure is obtained 
by considering directions v which are orthogonal to an affine hyperplane through p ran¬ 
domly drawn data points. In our implementation we use 250p directions. Table [2] shows 
the time needed to compute the AO (or SPD) of m £ {1, 50,100,1000} points with respect 
to a sample of n = 100 points in dimensions p £ {2, 3,4, 5}, as in Table [I] Here the time 
for m = 1 is the fixed cost of computing those 250p directions and projecting the original 
data on them. 


Table 2: Computation times for the AO (n = 100), in units of 0.001 seconds. 


p 

1 

50 

m 

100 

1000 

2 

15.0 

15.3 

15.6 

20.9 

3 

23.2 

23.9 

23.5 

31.3 

4 

30.5 

30.9 

31.6 

41.7 

5 

38.4 

39.1 

40.0 

52.2 


We see that computing AO is much faster than computing the bagdistance (Table [I]), 
and that this difference becomes more pronounced at larger p and m. This is mainly due 
to the fact that AO does not require to compute the deepest point in multivariate space, 
unlike the bagdistance ([2]) which requires G. 


4 Multivariate classifiers 

4.1 Existing methods 

One of the oldest nonparametric classifiers is the fc-nearest neighbor (kNN) method intro¬ 


duced by Fix and Hodges (1951). For each new observation the method looks up the k 


training data points closest to it (typically in Euclidean distance), and then assigns it to 
the most prevalent group among those neighbors. The value of k is typically chosen by 
cross-validation to minimize the misclassification rate. 
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Liu (1990) proposed to assign a new observation to the group in which it has the highest 
depth. This MaxDepth rule is simple and can be applied to more than two groups. On the 
other hand it often yields ties when the depth function is identically zero on large domains, 


as is the case with halfspace depth and simplicial depth. 

Dutta and Ghosh 

(20111 

) avoided 

this problem by using projection depth instead, whereas 

Hubert and Van der Veeken 


(2010) employed the skew-adjusted projection depth. 


To improve on the MaxDepth rule, Li et al. (2012) introduced the DepthDepth classifier 


as follows. Assume that there are two groups, and denote the empirical distributions of 
the training groups as P\ and TV Then transform any data point i S l p to the bivariate 
point 

( depth(x ; Pi), depth(x; P 2 )) (8) 

where depth is a statistical depth function. These bivariate points form the so-called 
depth-depth plot , in which the two groups of training points are colored differently. The 
classification is then performed on this plot. The MaxDepth rule corresponds to separating 


according to the 45 degree line through the origin, but in general Li et al. (2012) calculate 
the best separating polynomial. Next, they assign a new observation to group 1 if it 
lands above the polynomial, and to group 2 otherwise. Some disadvantages of the depth- 
depth rule are the computational complexity of finding the best separating polynomial 
and the need for majority voting when there are more than two groups. Other authors 


carry out a depth transform followed by linear classification (Lange et al., 2014) or kNN 


(Cuesta-Albertos et al., 2015) instead 


4.2 Classification in distance space 


It has been our experience that distances can be very useful in classification, but we prefer 
not to give up the affine invariance that depth enjoys. Therefore, we propose to use the 
bagdistance of Section T3 for this purpose, or alternatively the adjusted outlyingness of 
Section [3} Both are affine invariant, robust against outliers in the training data, and 
suitable also for skewed data. 

Suppose that G groups (classes) are given, where G ^ 2. Let P g represent the empirical 
distribution of the training data from group g = 1,... ,G. Instead of the depth transform 
([8]) we now carry out a distance transform by mapping each point x £ to the G -variate 
point 

(dist(x-, Pi),..., dist(x] Pq)) (9) 


where dist{x-,P g ) is a generalized distance or an outlyingness measure of the point x to 
the g -th training sample. Note that the dimension G may be lower, equal, or higher than 
the original dimension p. After the distance transform any multivariate classifier may be 
applied, such as linear or quadratic discriminant analysis. The simplest version is of course 
MinDist, which just assigns x to the group with smallest coordinate in ([ 9 ]). When using 
the Stahel-Donoho or the adjusted outlyingness, this is equivalent to the MaxDepth rule 
based on projection depth or skew-adjusted projection depth. However, we prefer to apply 
kNN to the transformed points. This combines the simplicity and robustness of kNN with 
the affine invariance offered by the transformation. Also note that we never need to resort 
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to majority voting. In the simulations in Section [5] we will see that the proposed DistSpace 
method (i.e. the distance transform <§ followed by kNN) works quite well. 



Figure 3: Scatterplot matrix of the banknote authentication data. The authentic bank¬ 
notes are shown in orange, the forged ones in blue. 


We now illustrate the distance transform on a real world example, available from the 
UCI Machine Learning Repository (Bache and Lichman, 2013). The data originated from 
an authentication procedure of banknotes. Photographs of 762 genuine and 610 forged 
banknotes were processed using wavelet transformations, and four features were extracted. 
These are the 4 coordinates shown in the scatterplot matrix in Figure [3] 

Note that G = 2. Using the bagdistance, the distance space of this data is Figure |4j 
It shows that forged and authentic banknotes are well-separated and that the authentic 
banknotes form a tight cluster compared to that of the forged ones. Any new banknote 
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Figure 4: Distance-distance plot of the banknote authentication data, 
would yield a new point in this plot, allowing kNN to classify it. 

5 Computational results 

To evaluate the various classifiers we apply them to simulated and real data. Their per¬ 
formance is measured by their average misclassification percentage "Yh g= i e g n g/N with e g 
the percentage of misclassified observations of group g in the test set, n g the number of 
observations of group g in the training set, and N the total size of the training set. This 
weights the misclassification percentages in the test set according to the prior probabili¬ 
ties. In each scenario the test set consists of 500 observations per group. This procedure 
is repeated 2000 times for each setting. 

Setting 1: Trivariate normals (G = 3, p = 3). We generate data from three 
different normal distributions. The first group C\ has parameters 

/0\ /5 3 1 

/r 1 = I 0 I and Si = I 3 2 1 

V \1 1 3 

The second group is generated like C\ but we flip the sign of the second coordinate. The 
third group is again generated like C\ but then shifted by the vector (1,-2,—4). The 
training data consist of 50 observations in each group. 

Setting 2: Multivariate normal and skewed (G = 2, p = 6). We consider two 
6-variate distributions. The first group C\ is drawn from the standard normal distribu¬ 
tion. The coordinates in the second group are independent draws from the exponential 
distribution with rate parameter 1: 

Ci~N( 0,/ 6 ) and C 2 ~ (Exp(l), Exp(l), Exp(l), Exp(l), Exp(l), Exp(l))'. 
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The training data has 150 observations drawn from group C\ and 100 from C 2 . 

Setting 3: Concentric distributions (G = 2,p = 7). This consists of two groups 
of data. The first group is drawn from the standard normal distribution. The second 
group is obtained by generating points on the unit sphere in and multiplying them by 
lengths which are generated uniformly on [12,13]. The training data has 150 observations 
from group C\ and 250 from C 2 . 

Setting 4: Banknote authentication data (G = 2, p = 4). We first standardize 
the data by the columnwise median and MAD. The training sets are random subsets 
of 500 points from the original data set, with the test sets each time consisting of the 
remaining 872 observations. 

Among the depth-based classification rules, halfspace depth (HD) is compared to pro¬ 
jection depth (PD) and skew-adjusted projection depth (SPD). We run the MaxDepth 
rule, DepthDepth followed by the best separating polynomial and DepthDepth followed by 
kNN. The degree of the polynomial and the number of neighbors k are selected based on 
leave-one-out cross-validation. 

Among the distance-based classifiers, the bagdistance based on halfspace depth ( bd) is 
compared to the Stahel-Donoho outlyingness (SDO) and the adjusted outlyingness (AO). 
Here the MinDist and DistSpace classifiers are considered. 

We evaluate all classifiers on the uncontaminated data, and on data where 5% and 
10 % of the observations in each group are mislabeled by assigning them randomly to 
another group. Figures [5]{8] summarize the results with boxplots of the misclassification 
percentages. 


kNN 

MaxDepth 

DepthDepth + poly 

DepthDepth + kNN 

MinDist 

DistSpace 

A 

I 

1 

1 

wi 


1 4 


. 


kNN HD PD SPD HD PD SPD HD PD SPD bd SDO AO bd SDO AO 


Figure 5: Misclassification percentages in 2000 runs of setting 1 (trivariate normals). The 
results for clean data are shown in gray, for 5% mislabeled data in orange, and for 10% 
mislabeled data in blue. 

In setting 1, most of the depth- and distance-based methods did better than kNN. The 
halfspace depth HD did not perform well in MaxDepth and DepthDepth + kNN , and in 
fact mislabeling improved the classification because it yielded fewer points with depth zero 
in both groups. Halfspace depth appeared to work better in DepthDepth + polynomial but 
this is due to the fact that whenever a point has depth zero in both groups, Li et al.| (2012) 
fall back on kNN in the original data space. Also note that DepthDepth + polynomial by 


L i et al.| (2012 
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construction improves the MaxDepth rule on training data, but it doesn’t always perform 
better on test data. 



Figure 6: Misclassification percentages in 2000 runs of setting 2 (6-variate normal and 
skewed) using the same color code. 

In setting 2 we note the same things about HD in the depth-based methods. The 
best results are obtained by DepthDepth + poly and DistSpace, where we note that the 
methods that are able to reflect skewness (HD, SPD, bd , AO) did a lot better than those 
that aren’t (PD, SDO). This is because the data contains a skewed group. 


fkN 

M MaxDepth 

DepthDepth + poly 

DepthDepth + kNN 

MinDist 

DistSpace 
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4- 
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kNN HD PD SPD 

HD 

PD SPD 

HD PD SPD 

bd SDO AO 

bd SDO AO 


Figure 7: Misclassification percentages in 2000 runs of setting 3 (concentric groups). 

In the third setting one of the groups is not convex at all, and the MaxDepth and 
MinDist boxplots lie entirely above the figure. On the other hand the DepthDepth and 
DistSpace methods still see structure in the data, and yield better results than kNN on 
the original data. 

In the banknote authentication example (setting 4), all methods except HD work well. 
For clean data, the two methods using the bagdistance outperform all others. 
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Figure 8: Misclassification percentages in 2000 runs of setting 4 (banknote data). 


6 Functional data 


The analysis of functional data is a booming research area of statistics, see e.g. the books 
of Ramsay and Silverman (2005) and Ferraty and Vieu (2006). A functional data set 


typically consists of n curves observed at time points t\,... ,tr- The value of a curve at 
a given time point is a p -variate vector of measurements. We call the functional dataset 
univariate or multivariate depending on p. For instance, the multi-lead ECG data set 


analyzed by Pigoli and Sangalli (2012) is multivariate with p = 8. 


When faced with classification of functional data, one approach is to consider it as 
multivariate data in which the measurement(s) at different time points are separate vari¬ 
ables. This yields high-dimensional data with typically many highly correlated variables, 
which can be dealt with by penalization (Hastie et ah, 1995). Another approach is to 


project such data onto a lower-dimensional subspace and to continue with the projected 
data, e.g. by means of support vector machines (Rossi and Villa, 2006 Martin-Barragan 


et ah, 2014). Li and Yu (2008) proposed to use F-statistics to select small subintervals in 


the domain and to restrict the analysis to those. Other techniques include the weighted 


( 2012 ). 


distance method of Alonso et al. (2012) and the componentwise approach of Delaigle et al. 


To reflect the dynamic behavior of functional data one can add their derivatives or 
integrals to the analysis, and/or add some preprocessing functions (warping functions, 
baseline corrections, ...) as illustrated in Claeske ns et al.| (2014). This augments the data 
dimension and may add valuable information that can be beneficial in obtaining a better 
classification. We will illustrate this on a real data set in Section PI 

The study of robust methods for functional data started only recently. So far, efforts 
to construct robust classification rules for functional data have mainly used the concept of 


depth: Lopez-Pintado and Romo (2006) used the modified band depth, Cuesta-Albertos 


and Nieto-Reyes (2010) made use of random Tukey depth, and Hlubinka et al. (2015) 


compared several depth functions in this context. 
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6.1 Functional depths and distances 


) proposed a type of multivariate functional depth (MFD) as follows, 
stochastic process Y = {Y(t),t 6 U}, a statistical depth function 
D(-, •) on M p , and a weight function w on U integrating to 1. Then the MFD of a curve 
X on U with respect to the distribution P y is defined as 


Claeskens et al. (2014 


Consider a p-variate 


MFD (X-P Y )= f D(X(t)-P y{t) )w{t)dt 
Ju 


( 10 ) 


where Py(t) is the distribution of Y at time t. The weight function w(t) allows to emphasize 
or downweight certain time regions, but in this paper will be assumed constant. The 
functional median ©(f) is defined as the curve with maximal MFD. Properties of the 
MFD may be found in (Claeskens et ah, 2014), with emphasis on the case where D(-, •) is 


the halfspace depth. Several consistency results are derived in (Nagy et al. 2016) 


For ease of notation and to draw quick parallels to the multivariate non-functional 
case, we will denote the MFD based on halfspace depth by fHD, and the MFD based on 
projection depth and skew-adjusted projection depth by fPD and fSPD. 

Analogously, we can define the functional bag distance (fbd) of a curve X to (the dis¬ 
tribution of) a stochastic process Y as 


fbd(X-P Y )= [ bd{X{t)-P Y (t))dt . ( 11 ) 

Ju 

Similar extensions of the Stahel-Donoho outlyingness SDO and the adjusted outlyingness 
AO to the functional context are given by 


fSDO {X-P Y ) 
fAO (X-P Y ) 


1 SDO (X(t);P Y (t))dt 
u 

(12) 

[ AO (X(t)-P Y (t))dt. 

)u 

(13) 


6.2 Functional classifiers 

The classifiers discussed in Section [4] are readily adapted to functional data. By simply 
plugging in the functional versions of the distances and depths all procedures can be 
carried over. For the /c-nearest neighbor method one typically uses the L 2 -distance: 


d 2 (X 1 ,X 2 ) = 


r \ V 2 

ljx 1 (t)-x 2 (t)\\ 2 dtj 


The functional kNN method will be denoted as fkNN. It is simple but not affine invariant. 
Analogously we use the MaxDepth and DepthDepth rules based on fHD, fPD, and fSPD, 


as well as the MinDist and DistSpace rules based on fbd , fSDO, and fAO. Note that Mosler 


and Mozharovskyi (2016) already studied DepthDepth on functional data after applying a 


dimension reduction technique. 
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7 Functional data examples 

7.1 Fighter plane dataset 


The fighter plane dataset of Thakoor and Gao (2005) describes 7 shapes: of the Mirage, 
Eurofighter, F-14 with wings closed, F-14 with wings opened, Harrier, F-22 and F-15. 


Each class contains 30 shape samples obtained from digital pictures, which Thakoor and 


Gao (2005) then reduced to the univariate functions in Figure [9j We obtained the data 


from the UCR Time Series Classification Archive (Chen et ah, 2015). 



100 


Figure 9: Functions describing the shapes of fighter planes. 

In all, the plane data set consists of 210 observations divided among 7 groups. For the 
training data we randomly drew 15 observations from each group, and the test data were 
the remaining 105 observations. Repeating this 200 times yielded the misclassification 
percentages in Figure [10} 



Figure 10: Misclassification percentages in 200 runs of the fighter plane data. 

In this data set the DistSpace method performed best, followed by kNN which however 
suffered under 10% of mislabeling. Figure 10 contains no panel for DepthDepth + poly 
because the computation time of this method was infeasible due to the computation of 
the separating polynomials combined with majority voting for G = 7. 
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7.2 MRI dataset 


Felipe et al. (2005) obtained intensities of MRI images of 9 different parts of the human 


body (plus a group consisting of all remaining body regions, which was of course very 
heterogeneous). They then transformed their data to curves. This data set was also 
downloaded from (Chen et al., 2015). The G = 9 classes together contain 547 observations 
and are of unequal size. For example n\ = 112 ,712 = 65 ,713 = 75,.... The curves for 4 
of these classes are shown in Figure 11 (if we plot all 9 groups together, some become 
invisible). 



“i-1-1- r~ 

20 40 60 80 


Figure 11: Curves computed from MRI intensities. 


For the training data we drew unequally sized random subsets from these groups. The 


misclassification rates of 200 experiments of this type are shown in Figure 12 



Figure 12: Misclassification percentages in 200 runs of the MRI data. 

Here DistSpace performs a bit better than fKNN under contamination, and much better 
than MaxDepth and MinDist. Also in this example the DepthDepth + poly method took 
too long to compute. 
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7.3 Writing dataset 


The writing dataset consists of 2858 character samples corresponding to the speed profile 
of the tip of a pen writing different letters, as captured on a WACOM tablet. The data 
came from the UCI Machine Learning Repository (Bache and Lichman 2013). We added 
the x- and (/-coordinates of the pen tip (obtained by integration) to the data, yielding 
p = 4 overall unlike both previous examples which had p = 1. We further processed the 
data by removing the first and last time points and by interpolating to give all curves 
the same time domain. Samples corresponding to the letters ‘a’, ‘c’, ‘e’, ‘h’ and ‘nr’ were 
retained. This yields a five-group supervised classification problem of four-dimensional 
functional data. Figure [l3| plots the curves, with the 5 groups shown in different colors. 



time time 


Figure 13: Coordinates (upper) and speed (lower) of the writing data. Each group has a 
different color. 

For each letter the training set was a random subset of 80 multivariate curves. The 
outcome is in Figure pT) There is no panel for the DepthDepth + poly classifier with sepa¬ 
rating polynomials and majority voting as its computation time was infeasible. MaxDepth 
and DepthDepth combined with kNN perform well except for fHD, again due to the fact 
that HD is zero outside the convex hull. DistSpace outperforms MinDist , and works well 


18 
























fkNN MaxDepth 

DepthDepth + kNN 

MinDist 

DistSpace 

t 

■ - 



I 

! 

m 

,1 

A 


i i 

. 

Efe— 

la 


fkNN fHD fPD fSPD fHD fPD fSPD fbd fSDO fAO fbd fSDO fAO 


Figure 14: Misclassification percentages in 200 runs of the writing data. 


with all three distances. The best result was obtained by DistSpace with fbd. 

Finally we applied fkNN and DistSpace to the original two-dimensional velocity data 
only. This resulted in larger median misclassification errors for all methods and all 3 data 
settings (0%, 5% and 10% mislabeling). For example, DistSpace with fbd on the two- 
dimensional data yielded a median misclassification error of 0.35%, whereas the median 
error was zero on the 4-dimensional augmented data. This shows that adding appropriate 
data-based functional information can be very useful to better separate groups. 


8 Conclusions 


Existing classification rules for multivariate or functional data, like kNN, often work well 
but can fail when the dispersion of the data depends strongly on the direction in which 
it is measured. The MaxDepth rule of Liu (1990) and its DepthDepth extension (Li et al. 


2012 ) resolve this by their affine invariance, but perform poorly in combination with depth 


functions that become zero outside the convex hull of the data, like halfspace depth (HD). 

This is why we prefer to use the bagdistance bd, which is based on HD and has proper¬ 
ties very close to those of a norm but is able to reflect skewness (while still assuming some 
convexity). Rather than transforming the data to their depths we propose the distance 
transform , based on bd or a measure of outlyingness such as SDO or AO. 

After applying the depth or distance transforms there are many possible ways to classify 
the transformed data. We found that the original separating polynomial method did not 
perform the best. Therefore we prefer to apply kNN to the transformed data. 

In our experiments with real and simulated data we found that the best performing 
methods overall were DepthDepth + kNN (except with halfspace depth) and DistSpace + 
kNN. The latter approach combines affine invariance with the computation of a distance 
and the simplicity, lack of assumptions, and robustness of kNN, and works well for both 
multivariate and functional data. 

In the multivariate classification setting the depth and distance transforms perform 
about equally well, and in particular MinDist on SDO and AO is equivalent to MaxDepth 
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on the corresponding depths PD and SPD. But the bagdistance bd beats the halfspace 
depth HD in this respect because the latter is zero outside the convex hull of a group. 

One of the most interesting results of our simulations is that the depth and distance 
transforms are less similar in the functional setting. Indeed, throughout Section [7] the 
distance transform outperformed the depth transform. This is because distances are more 
additive than depths, which matters because of the integrals in the definitions of functional 
depth MFD (10) to functional AO (13). For the sake of simplicity, let us focus on the 
empirical versions where the integrals in (10) to (13) become sums over a finite number of 
observed time points. These sums are L 1 norms (we could also use L 2 norms by taking the 
square root of the sum of squares). In the context of classification, we are measuring how 
different a new curve X is from a process Y or a finite sample from it. When X differs 
strongly from Y in a few time points, the integrated depth (10) will have a few terms equal 
to zero or close to zero, which will not lead to an extremely small sum, so X would appear 
quite similar to Y. On the other hand, a functional distance measure like (|1 1|)—(p~3]) will 
contain a few very large terms, which will have a large effect on the sum, thereby revealing 
that X is quite far from Y. In other words, functional distance adds up information about 
how distinct X is from Y. The main difference between the two approaches is that the 
depth terms are bounded from below (by zero), whereas distance terms are unbounded 
from above and thus better able to reflect discrepancies. 
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